Syntactic N-grams as machine learning features for natural language processing

نویسندگان

Grigori Sidorov

Francisco Velasquez

Efstathios Stamatatos

Alexander F. Gelbukh

Liliana Chanona-Hernández

چکیده

In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sngrams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any NLP task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, POS tags and characters; three classifiers were applied: SVM, NB, J48. Sn-grams give better results with SVM classifier.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

Syntactic Dependency-Based N-grams: More Evidence of Usefulness in Classification

The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sngrams allow bringing syntactic knowledge into machine learning methods. Still, previous parsing is necessary for their construction. We applied sn-grams in the task of authorship at...

متن کامل

N-gramas sintácticos no-continuos

In this paper, we present the concept of noncontinuous syntactic n-grams. In our previous works we introduced the general concept of syntactic n-grams, i.e., n-grams that are constructed by following paths in syntactic trees. Their great advantage is that they allow introducing of the merely linguistic (syntactic) information into machine learning methods. Certain disadvantage is that previous ...

متن کامل

Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model

We show how to consider similarity between features for calculation of similarity of objects in the Vec tor Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity be tween objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictio nary) and does not need to be learned from the data. We call the proposed...

متن کامل

Exploring Lexical and Syntactic Features for Language Variety Identification

We present a method to discriminate between texts written in either the Netherlandic or the Flemish variant of the Dutch language. The method draws on a feature bundle representing text statistics, syntactic features, and word n-grams. Text statistics include average word length and sentence length, while syntactic features include ratios of function words and partof-speech n-grams. The effecti...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Expert Syst. Appl.

دوره 41 شماره

صفحات -

تاریخ انتشار 2014

Syntactic N-grams as machine learning features for natural language processing

نویسندگان

چکیده

منابع مشابه

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Syntactic Dependency-Based N-grams: More Evidence of Usefulness in Classification

N-gramas sintácticos no-continuos

Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model

Exploring Lexical and Syntactic Features for Language Variety Identification

عنوان ژورنال:

اشتراک گذاری